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Abstract 



We propose an ICA contrast based on the density estimation of the observed signal and its 
marginals by means of wavelets. The risk of the associated moment estimator is linked with 
approximation properties in Besov spaces. It is shown to converge faster than the at least 
expected minimax rate carried over from the underlying density estimations. Numerical 
simulations performed on some common types of densities yield very competitive results, 
with a high sensitivity to small departures from independence. 
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1. Introduction 



In signal processing, blind source separation consists in the identification of analogical, 
independent signals mixed by a black-box device. In psychometry, one has the notion of 
structural latent variable whose mixed effects are only measurable through series of tests; an 
example are the Big Five (components of personality) identified from factorial analysis by 
researchers in the domain of personality evaluation (Roch, 1995). Other application fields 
such as digital imaging, biomedicine, finance and econometrics also use models aiming to 
recover hidden independent factors from observation. Independent component analysis 
(ICA) is one such tool; it can be seen as an extension of principal component analysis, in 
that it goes beyond a simple linear decorrelation only satisfactory for a normal distribution; 
or as a complement, since its application is precisely pointless under the assumption of 
normality. 



Papers on ICA are found in the research fields of signal processing, neural networks, statis- 
tics and information theory. Comon (1994) denned the concept of ICA as maximizing the 
degree of statistical independence among outputs using contrast functions approximated 
by the Edgeworth expansion of the Kullback-Leibler divergence. 

The model is usually stated as follows: let x be a random variable on R d , d > 2; one tries 
to find couples (A, s), such that x = As, where A is a square invertible matrix and s a 
latent random variable whose components are mutually independent. This is usually done 
through some contrast function that cancels out if and only if the components of Wx are 
independent, where W is a candidate for the inversion of A. 



Maximum-likelihood methods and contrast functions based on mutual information or other 
divergence measures between densities are commonly employed. Cardoso (1999) used 
higher-order cumulant tensors, which led to the Jade algorithm, Bell and Snejowski (1990s) 
published an approach based on the Infomax principle. Hyvarinen and Oja (1997) presented 
the fast ICA algorithm. 

In the semi-parametric case, where the latent variable density is left unspecified, Bach and 
Jordan (2002) proposed a contrast function based on canonical correlations in a reproducing 
kernel hilbert space. Similarly, Gretton et al (2003) proposed kernel covariance and kernel 
mutual information contrast functions. 

The density model assumes that the observed random variable X has the density /a given 
by 

f A (x) = \detA- 1 \f(A- 1 x) 

= \detB\f 1 (b 1 x)...f d (b d x), 

where bi is the I th row of the matrix B = A~ x \ this resulting from a change of variable if 
the latent density / is equal to the product of its marginals f 1 . . . f d . In this regard, latent 
variable s = (s 1 , . . . , s d ) having independent components means the indepence of the random 
variables s e o n e defined on some product probability space fl = JJ Q e , with n e the canonical 
projections. So s can be defined as the compound of the unrelated s 1 ,. . . , s d sources. 

Tsybakov and Samarov (2002) proposed a method of simultaneous estimation of the di- 
rections bi, based on nonparametric estimates of matrix functionals using the gradient of 

fA- 

In this paper, we propose a wavelet based ICA contrast. The wavelet contrast Cj compares 
the mixed density /a and its marginal distributions through their projections on a multires- 
olution analysis at level j. It thus relies upon the procedures of wavelet density estimation 
which are found in a series of articles from Kerkyacharian and Picard (1992) and Donoho 
et al. (1996). 

As will be shown, the wavelet contrast has the property to be zero only on a projected 
density with independent components. The key parameter of the method lies in the choice 
of a resolution j, so that minimizing the contrast at that resolution yields a satisfactory 
approximate solution to the ICA problem. 

The wavelet contrast can be seen as a special case of quadratic dependence measure, as 
presented in Achard et al. (2003), which is equal to zero under independence. But in our 
case, the resolution parameter j allows more flexibility in controlling the reverse implication. 
Let's mention also that the idea of comparing in the L 2 norm a joint density with the product 
of its marginals, can be traced back to Rosenblatt (1975). 

Besov spaces are a general tool in describing smoothness properties of functions; they also 
constitute the natural choice when dealing with projections on a multiresolution analysis. 
We first show that a linear mixing operation is conservative as to Besov membership; after 
what we are in position to derive a risk bound that will hold for the entire ICA minimization 
procedure. 

Under its simplest form, the wavelet contrast estimator is a linear function of the empirical 
measure on the observation. We give the rule for the choice of a resolution level j minimizing 
the risk, assuming a known regularity s for a latent signal in some Besov space B spq . 
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The estimator complexity is linear in the sample size but exponential in the dimension 
d of the problem; this is on account of an implicit multivariate density estimation. In 
compensation to this computational load, the wavelet contrast shows a very good sensitivity 
to small departures from independence, and encapsulates all practical tuning in a single 
parameter j. 



2. Notations 



We set here the main notations and recall some definitions for the convenience of ICA 
specialists. The reader already familiar with wavelets and Besov spaces can skip this part. 

■ Wavelets 

Let <p be some function of L 2 (R) such that the family of translates {<p(. — k), k e Z} is an 
orthonormal system; let Vj c L 2 (R) be the subspace spanned by {(pj k = 2^ 2 (p(2 j . - k), k e Z}. 

By definition, the sequence of spaces (Vj),j e Z, is called a multiresolution analysis (MRA) 
of L 2 (R) if Vj c Vj+i and Uj>o Vj ^ s dense in L 2 (R); <f is called the father wavelet or scaling 
function. 

Let (Vj)j e % be a multiresolution analysis of L 2 (R), with Vj spanned by {ipj k = 2^ 2 ^p{2? . — 
k),k e Z}. Define Wj as the complement of Vj in Vj + i, and let the families {ipjk, k £ Z} be 
a basis for Wj, with ip jk (x) = 2 j / 2 tp(2 j x - k). Let a jk (f) =< f,ip jk > and Pjk(f) =< f,ipjk >■ 

A function / e L 2 (M) admits a wavelet expansion on (Vj)j & z if t ne series 

oo 

k j=jo k 

is convergent to / in L 2 (R); ip is called a mother wavelet. 

The definition of a multiresolution analysis on L 2 (R d ) follows the same pattern. But an 
MRA in dimension one also induces an associated MRA in dimension d, using the tensorial 
product procedure below. 

Define Vf as the tensorial product of d copies of Vj. The increasing sequence (Vf)^^ defines 
a multiresolution analysis of L 2 (M. d ) (Meyer, 1997): 

for (i 1 ...,^) e {0,l} d and {i l . . . ,i d ) ji (0....0), define ^{x) lK ^ ld = Y[ d l=1 i>^\x e ), with 
ip(°) — ip, ipW — -0, so that ip appears at least once in the product ^(x) (we now on omit 
i 1 . . . ,i d in the notation for \P, and in (1), although it is present each time); 

for (i 1 . . . , i d ) = (0 . . . , 0), define $(x) = ULi <P&)\ 

for j e Z, k e Z d , x e R d , let V jk {x) = 2*T$(2ix - k) and ® jk {x) = 2^$(2^ - k); 

define Wf as the orthogonal complement of V d in V d +1 ; it is an orthogonal sum of 2 d — 1 
spaces having the form Uij . . . <g> Udj, where U is a placeholder for V or W; V or W are thus 
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placed using up all permutations, but with W represented at least once, so that a fraction of 
the overall innovation brought by the finer resolution j + 1 is always present in the tensorial 
product. 

A function / admits a wavelet expansion on the basis ($, \&) if the series 

oo 

E «iofc(/)*jok + E E (!) 

feeZ d J=jo ke1 d 

is convergent to / in L 2 (M. ). 
In fact, with the concentration condition 

J2\v(x + k)\<Ca.s., (2) 
k 

verified in particular for a compactly supported wavelet, any function in Li(M d ) admits a 
wavelet expansion. Otherwise any function in a Besov space B spq (R d ) admits a wavelet 
expansion. 

In connection with function approximation, wavelets can be viewed as falling in the category 
of orthogonal series methods, or also in the category of kernel methods. 

The approximation at level j of a funtion / that admits a multiresolution expansion is the 
orthogonal projection Pjf of / onto Vj c L 2 {M. d ) defined by: 

(Pjf)(x) = E <*jk*jk(x), 
ke1 d 

where a jk = a jk i_ tk a = J f(x)$ jk {x)dx. 

With the concentration condition above, the projection operator can also be written 

(Pjf)(x)= I K ] {x,y)f{y)d{y) 1 

with Kj(x,y) = 2 jd J2ke% d ®jk( x ~ k)$jk(y - k). Kj is an orthogonal projection kernel with 
window 2~ jd (which is not translation invariant). 

■ Besov spaces 

Let / be a function in L p (R d ) and h e R d . Define the first order difference Ahf by Ahf(x) = 
f(x+h) - f{x) and the k th order difference A k J = A^A^ 1 / (k = 1, 2, . . . with A°J = /, A}J = 
A fe /). 

The modulus of continuity of order k of / in the metric of L p , along direction h, is defined 
by (Nikol'skii, 1975, p. 145-160) 

u3 k h (f,6) p = sup\\Ay(x)\\ p , 6>0, \h\ = l. 
\t\<s 
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The modulus of continuity of order k of / in the direction of the subspace M m c R d is 
defined by 

ttR"i(f,5) p = sup w£(/,J) p . 

\h\=iMeR m 

If the function / has arbitrary derivatives of order g relative to the first m coordinates, one 
can define, for ft e M m , 

|n|=e 

with ft = (fti,...,ft m ,0,...,0), |ft| = 1, \n\ = J2T n i and h " = h T ■■■K? = ft™ 1 .ft" m .. .0°. 
The modulus of continuity of order k of the derivatives of order g of / is then defined by 

dk<»{f {e \S)v= sup uj k h (fi e \S) p =J2^(f {n) ,S) p . 
|/»|=i,heR m |„| =B 



Let s = [s] + a; the Holder space £fp(lR d ) is defined as the collection of functions in L p (R d ) 
such that 

d 

\\A h fW\\ p <M\h\ a , Vn=(n 1 ,...,n d ), with |n| = = [a], 

1 

or equivalents, n Rd (f^\S) p = sup uj h (f^\S) p < M5 a , 

heR d 

where M does not depend on ft. 

Besov spaces introduce a finer scale of smoothness than is provided by Holder spaces. For 
each a > this can be accomplished by introducing a second parameter q and applying (a, 
q) quasi-norms (rather than (a, oo)) to the modulus of continuity of order k. 

Let s > and (f>, fc) forming an admissible pair of nonnegative integers satisfying the in- 
equalities k > s - g > 0. By definition, / e belongs to the class B spq (R d ) if there exist 
generalized partial derivatives of / of order n = (n 1 , . . . , n d ), \n\ < g, and one of the following 
semi-norms is finite: 



^(/)= E (f\t- {s - e) n k Rd (.f {n) ,t) p 

\n\=g 

^ 9 (/)=( j ( 00 |i- (S - e) ^(.f (e) ,^| 9 f) 



(3) 



For fixed s and p, the space B spq gets larger with increasing q. In particular, for g = oo, 
B spq (M) = Zf* (R) ; various other embeddings exist since Besov spaces cover many well known 
classical concrete function spaces having their own history. 

Finally, Besov spaces also admit a characterization in terms of wavelet coefficients, which 
makes them intrinsically connected to the analysis of curves via wavelet techniques. 
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/ belongs to the (inhomogeneous) Besov space B spq (R d ) if 

i 

Q 

< oo, (4) 
with s > 0, 1 <p < oo, 1 < q < oo, and <p,tp eC r ,r > s (Meyer, 1997). 

A more complete presentation of wavelets linked with Sobolev and Besov approximation 
theorems and statistical applications can be found in the book from Hardle et al. (1998). 
General references about Besov spaces are Peetre (1975), Bergh & Lofstrom (1976), Triebel 
(1992), DeVore & Lorentz (1993). 



3. Wavelet contrast, Besov membership 

Let / be a density function with marginal distribution in dimension £, 

x 1 ^ ( f(x 1 ...,x d )dx 1 ...dx e - 1 dx e+1 ...dx d , 

JR d - 1 

denoted by f* e . 

As integrable positive functions, / and the f* e admit a wavelet expansion on a basis (<p, ip) 
verifying the concentration condition (2). One can then consider the projections up to 
order j, that is to say the projections of / and f* e on Vf and Vj respectively, namely 

p jf( x ) = J2 a 3k®jk{x) and Fjf* l (a?) = J2 a 3 k tl Pjk^x e ), 
ke1 d k e eZ 

where aj k e = J f* l {x l )ipj k i (x e ) dx i and ctj k = ctjk\...k d = / f( x )&jk( x )dx. 
Proposition 3.1 (wavelet contrast) 

Let f be a density function on M. d and let tp be the scaling function of a multiresolution analysis 
verifying the concentration condition (2). 

Define the contrast function 

Cj(f) = E ( a jk\..M ~ a jk 1 ■ ■ ■ OLjkd) 2 , 

k\..,k d 

with a jk i = j m f* t {x t )tp :jk t{x l )dx t and a jk i_ tk d = / B<J f(x)® jk i t _ ik d(x) dx. 
The following relation hold: 

f factorizable ==>- Cj(f) = 0. 
If f and ip are compactly supported or else if f 6 L 2 (R d ), the following relation hold: 

d 

cj(f) = o Pjf = n^r e . 

i=i 
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Proof 

As for the first assertion, with / = f 1 ...f d , one has f* e = f e , t = l,...d. Whence for 
k = (k 1 , . . . , k d ) e Z d , one has by Fubini theorem, 

«ifc(/) = aMr 1 ■ ■ ■ r d ) = L r 1 ■ ■ ■ r^Mdx 

J JR. 

= I r^jk^dx 1 ... f r d < Pjk «(x d )dx = a jkl (t 1 )...a jkd (r d ). 

For the second assertion, Cj = entails aj k (f) = a jk i(f* 1 ) . . . a jk d(f* d ) for all k e Z d . So 
that for Pjf e L p (R d ), 

p jf = ^ a 3k(f)$jk =J2a jk i(f 1 )tp jk i...a jk<1 (f d )<p jk< i 

k k 
k 1 k d 

= pjr i ...pfr d , 

with passage to line 2 justified by the fact that (ctjk(f)^jk) ke z d ^ s a summable family of 
L 2 (M. d ) or else is a finite sum for a compactly supported density and a compactly supported 
wavelet. 
□ 

For the zero wavelet contrast to give any clue as to whether the non projected difference 
/ - f* 1 . . . f* d is itself close to zero, a key parameter lies in the order of projection j. 

Under the notations of the preceding proposition, with a zero wavelet contrast and assuming 
existence in L p , one has \\Pjf - Pjf* 1 ■ ■ ■ Pff* d \\p = 0, and so 

ii / - r 1 . . . r d lip < ii/ - Pjf\\p + wp^r 1 . . . pfr d - r 1 . . . r% 
= \\f-p j f\\ P +\\p j (r l ...r d )-r l ...r d \\ P . 

If we now impose some regularity conditions on the densities, in our case if we now require 
that / and the product of its marginals belong to the (inhomogeneous) Besov space B spq (R d ), 
the approximation error can be evaluated precisely. With a r-regular wavelet ip, r > s, the 
very definition of Besov spaces implies for any member / that (Meyer, 1997) 

\\f-P J f\\ p = 2-^e 3 , { ej }ee q (N d ). (5) 

Remark 

In the special case where /a and the product of its marginals belong to L 2 {M d ), Parseval 
equality implies that Cj is equal to the square of the L 2 norm of Pjf a - Pjf "a ■ ■ ■ Pjf*A- 
one can write, 

Cj{fA)^ = \\Pj{f A 1 ...f A d )-PjfA\\2 

< \\fA Pjf A\\2 + \\fA /a • • • fl% + \\Pj(fA ■ ■ ■ fA) fA ■ ■ ■ fX%, 
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hence with notation K+(A, /) = \\f A - . . . p A d \\ 2 , 

\K*(A, f) Cj(f A )i\ < \\f A PjUh + \\PjifX 1 . . . ff) ff . . . f A %, (6) 

which gives another illustration of the shrinking with j distance between the wavelet con- 
trast and the true norm evaluated at f A . In particular when A ^ I, Cj(f A ) cannot be small 
and for A = I, C 3 must be small, for j big enough. 

Continuing on the special case p = 2, the wavelet contrast can be viewed as an example of 
quadratic dependence measure as presented in the paper from Achard et al (2003). 

Using the orthogonal projection kernel associated to the function ip, one has the writing 
Cj(fA)=j Rd ^ AKj ( X ,Y)-f[ElK}( X \Y^ dx, 

with Kj(x,y) = 2 jd J2keZ d ®]k{x - k)<5> ]k {y - k) and Ki(x,y) = 2^J2keZ - ^Wjkiy 1 - k l ). 

This is the form of the contrast in the paper from Achard et al. (2003), except that in 
our case the kernel is not scale invariant; but the ICA context is scale invariant by feature, 
since the inverse of A is conventionally determined up to a scaling diagonal or permutation 
matrix, after a whitening step. 



To take advantage of relation (5) in the ICA context, we need a fixed Besov space containing 
the mixed density f A and the product of its marginals, for any invertible matrix A. 

The two following propositions check that the mixing by A is conservative as to Besov 
membership, and that the product of the marginals of a density / belongs to the same 
Besov space than /. It is equivalent to assume that / is in B spq (R d ) or that the factors /' 
are in B spq (R). If the factors have different Besov parameters, one can theoretically always 
find a bigger Besov space using Sobolev inclusions 

B s , pql C B spq for s' > s, q' < q; 

B spq c B s > p i q for p < p' and s' = s + d/p' - d/p. 



Proposition 3.2 (Besov membership of marginal distributions) 

/// is a density function belonging to B spq (R d ) then each of its marginals belong to B spq (M.). 
proof 

Let us first check the L p membership of the marginal distribution. For p > 1, by convexity 
one has, 

/ \f A \ p dx= f f \f A \ p dx* e dx e > [ f f A dx* (P dx e = [ \f* A l \ p dx l ; 
that is to say < ||/a|| p . 



With the £ th canonical vector of R d denoted by e l and for t e R, one has, 
Af/*V) = £(-l) i+fe ^/*'(z + *) = E(-l)' +fe ^ / /(a: + te £ )^* £ - / A* e< /(z)c^; 

' ' " /Trod— 1 /irad— 1 

; n ; n IK. •'IK. 



so that 



and 



dx l < J^\A k tel f{x)\ P dx < \\A k tee f\\l A 



^{.r\5) p = sup \\Air\\ Lpm < snp ||A t V|| (Rd) = ^(/,<5) p , 

|t|<<5 |t|<<5 

and 

n fc (r / ,<y) P = w fc (r / ,<y) P <^(/,<y) P < su P w£(/,*) P = 

h|=i,/ieR d 

Using the admissible pair (k, g) = ([s] + 1,0), one can see from (3) that J' spq (f* e ) < J' spq (f)- 
□ 



Next, we check that the mixed density f A belongs to the same Besov space than the original 
density /. 



Proposition 3.3 (Besov membership of the mixed density) 

Let f = f 1 . . . f d andf A (x) = | det A~ 1 \f(A~ 1 x). 

(a) if f G L p (R d ), or if each f e belongs to L p (R), then f A and the product Y\f A belong to L p (R d ). 

(b) f and f A have same Besov semi-norm up to a constant. 

Hence f and f A belong to the same (inhomogeneous) Besov space B spq . 

proof 

For (a), with p > 1, as in Prop. 3.2 above, one has < ||/a|| p - Also, with the 

determinant of A denoted by \A\, 

\\f A \\ p = \A\-*> f \f(A- 1 x)fd x = \A\-P [ \f(x)\>\A\dx = \A\ 1 -*\\f\\ p . 



And finally by Fubini theorem, ||/|| ip(R£i) = Wfh^ . . . \\f d \\ Lp (R), so that / G L p (R d ) 
f £ L p (R),£=l...d . 

For (b), with a change of variable in the integral one has, 



\A th f A \\ p = \A\- 1+ ^ \\A tA -i h f\\ p : 



so that 



u h (fA,S) p = sup \\A th f A \\ p = \A\ 1+ p sup \\A th f\\ p = ui(f,S\A 1 h\) p , \h\ = 1; 

|t|<«,|h|=l |t|<«|A-i/i|,|/i|=l 



and 

n Rd (f A ,S) p = lAr^Q^if^lA^hDp, \h\ = 1. 
Next, with the change of variable u = 

POO J i fOC J 

/ \t- a n(f A ,t^- = (\A\- i+ i\A- i hrr / i«-^(/ )U )i«— , \h\ = i 

Jo t Jo u 

< (\a\- i+1 p wA-'ry r '\u- a mu)\ q -. 

Jo u 

In view of (3), using the admissible pair (k,g) = ([s] + 1, [s]) yields the desired result when 
< s < 1. 

When 1 < s, with the same admissible pair (k, g) — ([s] + 1, [s]), and by recurrence, since 
df A (h) = \A~ 1 \df(A~ 1 h) o A^ 1 one can see in the same way that the modulus of continuity 
of the (generalized) derivatives of f A or order k are bounded by those of /. 

Note that if A is whitened, in the context of ICA, the norms of / and f A are equal, at least 
when s < 1. □ 



4. Risk upper bound 

Define the experiment £ n = {X® n , A® n , (X u ..., X n ), Pf A , f A e B spq ), where X u ...,X n is 
an iid sample of X = AS, and P? = P$ A . . . <g> Pf A is the joint distribution of {X\ . . . , X n ). 
Likewise, define PJ as the joint distribution of (Si ... , S n ). 

The coordinates aj k in the wavelet contrast are estimated as usual by, 

_^ n 1 n 

a ok \..,k d = ~ ^' fel ■ ■ ■ ^ fed and & 'J kl = ~ ^ i=1 >--- d - ( 7 ) 

n i=i n i=i 

The linear wavelet contrast estimator is given by, 

C j (x 1 ...,x n )= Y ( & jk\..,ki -ot ]k i ...a ]k d) 2 = ^2 ^ 2 fe ; ( 8 ) 

k\..,k d keZ d 

where we define Sj k as the difference ajk\..,k d — ^jk 1 ■ ■ ■ a jk d. 

The estimator a jk is unbiased under EJ A , but so is not a ]k i . . .a jkd unless A = I, although 
it is asymptotically unbiased. 

We also make the assumption that both the density and the wavelet are compactly sup- 
ported so that all sums in k are finite. For simplicity we further suppose the density support 
to be the hypercube, so that J2k 1 ~ 

To bound the risk we use a single lemma whose proof relies on a classical U-statistic lemma, 
namely the connection between a U-statistic and its associated Von Mises statistic. To fit 
our purpose the U-statistic lemma first needed to be adapted to kernels that are (generally 
unsymmetric) products of $ jfe and tp jk o n e , and thus depend on the resolution parameter 
j. This is done in lemmas 7.2 appearing in the Appendix. 
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Lemma 4.1 (Moments of wavelet coefficients estimators) 

Let p, a > be fixed integers; the following relations hold: 

E f A & jk( a 3ki ■ ■ ■ & 3 k*Y = a p jk (a jk i . . . a 3kd f + 0{n- v ). 
And in corollary, EJ A 5 p k = 5 p - k + 
proof 

To the statistic V nj = a P k (a jk i ...a jk d) cr corresponds a U-statistic U n j with unsymmetric 
kernel 

hj{Xl, X p+da ) = <& jk {Xl) . . . ^ jk (Xp)ip jkei O TT ei (x p+1 ) . . . if jk l da O TT ld '(x p+da ), 

with ir l the canonical projection on component £, • ■ • = (b • • • , d), i = . . . d- 1 

and $(ar) =I\Li ( P°^( x )- 

By application of lemma 7.2 in Appendix, 

\E] A a p k (a 3kl . ..& jk *Y - aP(a 3kl . ..a jkd f\ < E]jV n3 - U nj \ = O^" 1 ) 

□ 

We now express a risk bound for the wavelet contrast estimator. In particular we show 
that the bias of the estimator is of the order of C2^ d /n. This is better than the convergence 
rate of n~ for the empirical Hilbert-Schmidt independence criterion (Gretton et al. 2004, 
theorem 3), except that in our case the resolution parameter j must still be set to some 
value, especially to cope with the antagonist objectives of reducing the estimator bias and 
variance. 

Proposition 4.4 (Risk upper bound for Cj) 

The quadratic risk E r j A (Cj — Cj) 2 and the bias E^ A Cj — Cj have convergence rate 2^ d O{l/n). 

In corollary, the variance E^ A ^Cj — Ej A Cj*j has convergence rate 2? d O{\jn) and the quadratic 
risk around zero is E'^Jj 2 = C? + 2 ld O(l/n). 

proof 

The risk about Cj is written, 

E] A (C 3 C 3 ) 2 = E] A J2(S% 5%)(S% 6%) (9) 

k,e 

For the squared terms where k = I , lemma 4.1 yields directly E^ A (8 2 k - 8 2 k ) 2 = 0(n _1 ), so 
that the corresponding risk component is bounded by C2 jd n~ 1 . 
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For crossed terms where k ^ £, observe that with a compactly supported Daubechies Wavelet 
D2N, whose support is [0,2N — 1], only a thin band of terms around the diagonal is non 
zero: 

VjkVji = 0, for \£ - k\ > 2N - 1. 

When k is fixed, the cardinal of the set \£ — k\ < 2N - 1 is lower than (AN) d ; hence, by 
Cauchy-Schwarz inequality and lemma 4.1, 

e? a - - 4) < E - 6 %) 2 E WM- m h 

k,t k \e-k\<2N-l 

< ^Cn- 1 ' 2 {AN) d Cn- 1 ' 2 = C2 jd n~ 1 . 

The bias convergence rate is a direct consequence of lemma 4.1, and the two remaining 
assertions follow from the usual relations, E n fA (C) - Cj) 2 = E n fA (Cj - E^Cj) 2 + {E'f A C 3 - Cj) 2 ; 

and E] A C] = (E- A C 3 ) 2 + EJJC 3 E^C,) 2 . 
□ 



We now give a rule for choosing the resolution j minimizing the (about zero) risk upper 
bound. This rule, obtained as usual through bias-variance balancing, depends on s, the 
unknown regularity of /, supposed to be a member of some Besov space B spq . The associated 

convergence rate improves upon the minimax of the underlying density estimations 

(see Kerkyacharian & Picard, 1992). 

Proposition 4.5 (minimizing resolution in the class B s2o c) 

Assume that f belongs to i? s2oo (K <i ), and that Cj is based on a r -regular wavelet ip, r > s. 

The minimizing resolution j is such that 2 J w n 4s + d and ensures a quadratic risk converging to zero 

_ 4s 

at rate n 4s + d . 
proof 

By Prop. 3.2 and 3.3 we know that Ja and the product of its marginal distributions belong 
to the same Besov space than the original /, so that equation (6) becomes 

\K i ,(A,f)-C j (f A )i\<K2-^; (10) 

with K a constant. 

Taking power 4 of (10) and using prop. 4.4, 

R(dj,f A ) + K*Q{CJ,K*) < K2-^ s + V d Kn- x , 

with K a placeholder for an unspecified constant, Q{a,b) = -4a 3 + 6a 2 b - Aab 2 + b 3 , and R 
denoting the quadratic risk around zero. 

When A is far from I, the constant is strictly positive and the risk relative to zero has 
no useful upper bound. Although the risk relative to Cj is always in 2 id Kn~ 1 . 
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With A getting closer to /, is brought down to zero and the bound is minimum when, 
constants apart, we balance 2~ 4:,s with 2 jd n" 1 , or 2^ d+4s ^ with n. 

1 —4s 

This yields 2 J = 0(n :n + 3 ) and convergence rate n 35 ^ for the risk relative to zero under 
independence and also for the risk relative to Cj by substitution in the expression given by 
Prop. 4.4. 
□ 



Corollary 4.1 (minimizing resolution in the class B spq ) 

Assume that f belongs to i? spg (R d ) ; and that Cj is based on a r-regular wavelet ip, r > s' . 

The minimizing resolution j is such that 2- 7 w n is '+ d , with s' = s + d/2 — d/p if 1 < p < 2 and s 1 = s 
ifp>2. 

is' 

This resolution ensures a quadratic risk converging to zero at rate n 4 ='+<* . 
proof 

If 1 < p < 2, using the Sobolev embedding B spq c B s > p > q for p < p' and s' = s + d/p' — d/p, one 
can see that /a belongs to B s i 2q with s' = s + d/2 - d/p, and so by definition, with {e 3 } e £ q , 

||/A-^A|| 2 <^2-^+ d / 2 - d /P). 
If p > 2, since we consider compactly supported densities, with {ej} e £ q , 

WfA-PjfAh^WfA-PjfAWp^e^'. 

Finally with s' as claimed, equation (6) yields again|i ; i'^( J 4, /) - Cj(fA)^ \ < K2~ js ' . □ 



5. Computation of the estimator Cj 

The estimator is computable by means of any Daubechies wavelet, including the Haar 
wavelet. 

For a regular wavelet (D2N, N > 1), it is known how to compute the values (pjk(x) (and any 
derivative) at dyadic rational numbers (Nguyen and Strang, 1996); this is the approach we 
have adopted in this paper. 

Alternatively, using the customary filtering scheme, one can compute the Haar projection at 
high j and use a discrete wavelet transform (DWT) by a D2N to synthetize the coefficients 
at a lower, more appropriate resolution before computing the contrast. This avoids the 
need to precompute any value at dyadics, because the Haar projection is like a histogram, 
but adds the time of the DWT. 

While this second approach usually gives full satisfaction in density estimation, in the ICA 
context, without special care, it can lead to an inflation of computational resources, or a 
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possibly inoperative contrast at minimization stage. Indeed, for the Haar contrast to show 
any variation in response to a small perturbation, j must be very high regardless of what 
would be required by the signal regularity and the number of observations; whereas for 
a D4 and above, we just need to set high the precision of dyadic rational approximation, 
which present no inconvenience and can be viewed as a memory friendly refined binning 
inside the binning in j. 

We have then chosen the approach with dyadics for simplicity at the minimization stage 
and possibly more accurate solutions. 

Also for simplicity, in all simulations that follow we have adopted the convention that 
the whole signal is contained in the hypercube [0, l]® d , after possible rescaling. For the 
compactly supported Daubechies wavelets (Daubechies, 1992), D2N,N = 1,2,..., whose 
support is [0, 2N - 1], the maximum number of k intersecting with an observation lying in 
the cube is {V + 2N - 2) d . 

Note that relocation in the unit hypercube is not a requirement, but otherwise a sparse 
array implementation should be used for efficiency. 



Sensitivity of the wavelet contrast 



In this section, we compare independent and mixed D2 to D8 contrasts on a uniform 
whitened signal, in dimension 2 with 100000 observations, and in dimension 4 with 50000 
observations. According to proposition 4.5, for s = +oo the best choice is j — 0, to be 
interpreted as the smallest of technically working j, i.e. satisfying 2 j > 2N - 1, to ensure 
that the wavelet support is mostly contained in the observation support. 

For j = 0, there is only one cell in the cube and the contrast is unable to detect any 
mixing effect: for Haar it is identically zero, and for the others D2N it is a constant (quasi 
for round-off errors) because we use circular shifting if the wavelet passes an end of the 
observation support. At small j such that 2 < 2 3 < 2N - 1, D2N wavelets behave more or 
less like the Haar wavelet, except they are more responsive to a small perturbation. We 
use the Amari distance as defined in Amari (1996) rescaled from to 100. 

In this example, we have deliberately chosen an orthogonal matrix producing a small Amari 
error (less than 1 on a scale from to 100), pushing the contrast to the limits. 



j 


D2 indep 


D2 mixed 


cpu 








000E+00 





000E+00 





12 


1 





184E-06 





102E-10 





06 


2 





872E-04 





199E-04 





06 


3 





585E-03 





294E-03 





06 


4 





245E-02 





285E-02 





06 


5* 





926E-02 





110E-01 





07 


6 





395E-01 





387E-01 





07 


7 





162E+00 





162E+00 





07 


8 





651E+00 





661E+00 





08 


9 





262E+01 





262E+01 





12 


10 





105E+02 





105E+02 





23 


11 





419E+02 





419E+02 





69 


12 





168E+03 





168E+03 


2 


48 



j 


D4 indep 


D4 mixed 


cpu 








250E+00 





250E+00 





21 


1* 





239E+00 





522E+00 





17 


2 





198E-04 





209E-04 





17 


3 





127E-03 





159E-03 





17 


4 





635E-03 





714E-03 





17 


5 





235E-02 





282E-02 





17 


6 





988E-02 





105E-01 





17 


7 





405E-01 





419E-01 





17 


8 





163E+00 





165E+00 





21 


9 





653E+00 





653E+00 





26 


10 





261E+01 





262E+01 





39 


11 





104E+02 





105E+02 





87 


12 





419E+02 





420E+02 


2 


67 



Table la. Wavelet contrast values for a D2 and a D4 on a uniform density in dimension 2 under a half degree rotation 

Amari error s=a .8, nobs=100000, L=10, 
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D6 indep 


D6 mixed 


cpu 








304E+00 





304E+00 





37 


1 





304E+00 





305E+00 





37 


2* 





215E+00 





666E+00 





37 


3 





132E-03 





188E-03 





36 


4 





641E-03 





717E-03 





36 


5 





295E-02 





335E-02 





35 


6 





123E-01 





126E-01 





37 


7 





495E-01 





518E-01 





36 


8 





198E+00 





200E+00 





41 


9 





796E+00 





791E+00 





49 


10 





319E+01 





319E+01 





64 


11 





127E+02 





128E+02 


1 


13 


12 





509E+02 





511E+02 


2 


97 



Table lb. Wavelet contrast values for a D6 and a D8 on a . 

Amari error ~ .8, n< 





D8 indep 


D8 mixed 


cpu 








966E+00 





966E+00 





65 


1 





966E+00 





197E+01 





64 


2* 





914E+00 





333E+01 





65 


3 





446E-03 





409E-03 





64 


4 





220E-02 





214E-02 





64 


5 





932E-02 





104E-01 





63 


6 





388E-01 





383E-01 





63 


7 





157E+00 





160E+00 





64 


8 





628E+00 





630E+00 





71 


9 





253E+01 





252E+01 





84 


10 





101E+02 





101E+02 


1 


03 


11 





405E+02 





406E+02 


1 


53 


12 





162E+03 





162E+03 


3 


37 



niform density in dimension 2 under a half degree rotation 
•bs=100000, L=10, 



First, the Haar contrast is out of touch; at low resolution the mixing passes unnoticed 
because the observations stay in their original bins, and at high resolution, as for the other 
wavelets, any detection becomes impossible because the ratio 2 jd /n gets too big, and clearly 
wanders from the optimal rule of Prop. 4.5. 

Had we chosen a mixing with bigger Amari error, say 10, the Haar contrast would have 
worked at many more resolutions (this can be checked using the program icalettel); still, 
the Haar contrast is less likely to reach small Amari errors in a minimization process. 

For wavelets D4 and above, the contrast is able to capture the mixing effect especially at 
low resolution (resolution with largest relative increase marked) and up to j = 8. Also, the 
wavelet support technical constraint is apparent between D4 and D6 or D8. 

Finally we observe that the difference in computing time between Haar and a D8 is not 
significative in small dimension; it gets important starting from dimension 4 (Table 2). 
Note that the relatively longer cpu time for 2 3 < 2N - 1 is caused by the need to compute 
a circular shift for practically all points instead of only at borders. 



j 


D2 indep 


D2 mixed 


cpu 




j 


D4 indep 


D4 mixed 


cpu 





. 000E+00 


. 000E+00 


0.08 







0.625E-01 


0.625E-01 


0.85 


1 


0.100E-03 


0.155E-06 


0.05 




1 


0.624E-01 


0.304E+00 


0.83 


2 


0.411E-02 


0.221E-02 


0.05 




2 


0.283E-03 


0.331E-03 


0.82 


3 


0.831E-01 


0.684E-01 


0.05 




3 


0.503E-02 


0.453E-02 


0.83 


4 


0.132E+01 


0.129E+01 


0.08 




4 


0.818E-01 


0.824E-01 


0.92 


5 


0.210E+02 


0.210E+02 


0.29 




5 


0.130E+01 


0.133E+01 


1.30 


6 


. 336E+03 


. 335E+03 


3.62 




6 


0.211E+02 


0.211E+02 


4.68 



j 


D6 indep 


D6 mixed 


cpu 




j 


D8 indep 


D8 mixed 


cpu 





0.926E-01 


0.926E-01 


6.03 







. 934E+00 


. 934E+00 


22.8 


1 


0.927E-01 


0.929E-01 


6.01 




1 


0.934E+00 


0.364E+01 


22.8 


2 


0.884E-01 


. 825E+00 


6.01 




2 


0.937E+00 


0.111E+02 


22.8 


3 


0.725E-02 


0.744E-02 


6.07 




3 


0.751E-01 


0.751E-01 


22.9 


4 


0.122E+00 


0.117E+00 


6.40 




4 


0.124E+01 


0.117E+01 


24.1 


5 


0.193E+01 


0.195E+01 


7.51 




5 


0.196E+02 


0.196E+02 


27.0 


6 


0.311E+02 


0.311E+02 


11.0 




6 


0.313E+03 


0.313E+03 


30.8 



Table 2. Wavelet contrast values on a uniform density, dim=4 , nobs=50000, L=10, Amari error ~ .5 

Computation uses double precision, but single precision works just as well. There is no 
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guard against inaccurate sums that occur about 10% of the time for D4 and above, because 
it does not prevent a minimum contrast from detecting independence. Dyadic approxi- 
mation parameter L is set at octave 10, about three exact decimals, and shows enough. 
Cpu times, in seconds, correspond to the total of the projection time on V d and on the 
d Vj, added to the wavelet contrast computation time; machine used for simulations is a 
G4 l,5Mhz, with IGo ram; programs are written in fortran and compiled with IBM xlf 
(program icalettel to be found in Appendix). 

Contrast complexity 

By complexity we mean the length of do-loops. 

The projection of n observations on the tensorial space V d and the d margins for a Db(2N) 
has complexity 0(n(2N - l) d ). This is 0(n) for a Haar wavelet (2N=2) which boils down 
to making a histogram. The projection complexity is almost independent of j except for 
memory allocation. Once the projection at level j is known, the contrast is computed in 
0(2P d ). 

On the other hand, the complexity to apply one discrete wavelet transform at level j has 
complexity 0(2 jd (2N - l) d ). So we see that the filtering approach consisting in taking the 
Haar projection for a high ji (typically 2 3ld w and filter down to a lower j , as a 

shortcut to direct D2N moment approximation at level jo, is definitely a shortcut; except 
that the Haar wavelet carries with it a lack of sensitivity to small perturbations, which 
is a problem for empirical gradient evaluation or the detection of a small departure from 
independence. 

For comparison, the Hilbert-Schmidt independence criterion is theoretically computed in 
0(n 2 ) (Gretton et al. 2004 section 3), and the Kernel ICA criterion is theoretically com- 
puted in 0(n 3 d 3 ). In both cases, using incomplete Choleski decomposition and low-rank 
approximation of the Gram matrix, the complexity is brought down in practice to 0(nd 2 ) 
for HSIC and 0{n 2 logn) for Kernel ICA(Bach and Jordan 2002 p. 19). 

6. Contrast minimization 

The natural way to minimize the ICA contrast as a function of a demixing matrix W, is 
to whiten the signal and then carry out a steepest descent algorithm given the constraint 
l WW = Id, corresponding to W lying on the the Stiefel manifold S(d,d) = 0(d). In the ICA 
context, we can restrict to SO (d) c O(d) thus ignoring reflections that are not relevant. 

Needed material for minimization on the Stiefel manifold can be found in the paper of Arias 
et al. (1998). Another very close method uses the Lie group structure of SO(d) and the 
corresponding Lie algebra so(d) mapped together by the matrix logarithm and exponential 
(Plumbley, 2004). For convenience we reproduce here the algorithm in question, which is 
equivalent to a line search in the steepest descent direction in so(d): 

■ start at O e so(d), equivalent to / e SO(d); 

■ move about in so(d) from to — r\ V_bJ, where rj e M + corresponds to the minimum in 
direction V B J found by a line search algorithm, where V_b J = VJ*W— W*VJ is the gradient 
of J in so(d), and where VJ is the gradient of J in SO(d); 

■ use the matrix exponential to map back into SO(d), giving R = exp(-r)\7 bJ); 
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calculate W = RW e SO(d) and iterate. 



We reproduce below some typical runs (program icalette3), with a D4 and L = 10. Note 
that on example 2, the contrast cannot be usefully minimized because of a wrong resolution. 



d=3, 


j=3, n=30000 uniform 




d=3, 


j=5, n=30000 uniform 




d=3, 


j=3, n=10000 uniform 


it 


contrast 


amari 




it 


contrast 


amari 




it 


contrast 


amari 





0.127722 


65.842 







0.321970 


65 . 842 







0.092920 


42.108 


1 


0.029765 


15.784 




1 


0.321948 


65.845 




1 


0.035336 


14.428 


2 


0.002600 


2.129 




2 


0.321722 


65.999 




2 


. 007458 


3.392 


3 


0.001939 


0.288 




3 


0.321721 


65.999 




3 


0.006345 


1.684 


4 








4 








4 


0.006122 


1.109 


5 








5 








5 


0.006008 


0.675 




















d=4, 


j=2, n=10000 uniform 




d=3, 


j=4, n=30000 expone. 




d=3, 


j=3, n=10000 semici. 


it 


contrast 


amari 




it 


contrast 


amari 




it 


contrast 


amari 





0.025193 


22.170 







8.609670 


52.973 







0.041392 


35.080 


1 


0.010792 


9.808 




1 


5.101633 


48 . 744 




1 


0.029563 


22.189 


2 


0.003557 


4.672 




2 


0.778619 


16 . 043 




2 


0.007775 


5.601 


3 


0.001272 


1.167 




3 


0.017585 


3.691 




3 


0.006055 


3.058 


4 


0.001033 


0.502 




4 


0.008027 


2.262 




4 


0.005387 


2.261 


5 


0.000999 


0.778 




5 


0.006306 


1.542 




5 


0.005355 


1.541 



Table 3. Minimization examples at various j, d and n with D4 and L=10 



In our simulations, V J is computed by first differences; in doing so we cannot keep perturbed 
W orthogonality, and we actually compute a plain gradient in R dd . 

Again, a Haar contrast empirical gradient is tricky to obtain, since a small perturbation 
in W will likely result in an unchanged histogram at small j, whereas with D4 and above 
contrasts, response to perturbation is practically automatic and is anyway adjustable by 
the dyadic approximation parameter L. 

Below is the average of 100 runs in dimension 2 with 10000 observations, D4, j = 3 and 
L = 10 for different densities; start columns indicate Amari distance (on the scale to 100) 
and wavelet contrast on entry; it column is the average number of iterations. Note that 
for some densities after whitening we are already close to the minimum, but the contrast 
still detects a departure from independence; the routine exits on entry if the contrast or 
the gradient are too small, and this practically always correspond to an Amari distance less 
than 1 in our simulations. 



density 


Amari start 


Amari 


end 


cont 


start 


cont . end 


it. 


uniform 


53.193 





612 





509E-01 


. 104E-02 


1.7 


exponential 


32 . 374 





583 





616E-01 


0.150E-03 


1.4 


Student 


2.078 


1 


189 





534E-04 


0.188E-04 


0.1 


semi-circ 


51.401 


2 


760 





222E-01 


0.165E-02 


1.8 


Pareto 


4.123 





934 





716E-03 


0.415E-05 


0.3 


triangular 


46 . 033 


7 


333 





412E-02 


0.109E-02 


1.6 


normal 


45.610 


45 


755 





748E-03 


0.408E-03 


1.4 


Cauchy 


1.085 





120 





261E-04 


0.596E-06 


0.1 



Table 4. Average results of 100 runs in dimension 2, j=3 with a D4 at L=10 



These first results are comparable with the performance of existing ICA algorithms, as 
presented for instance in the paper of Jordan and Bach (2002) p. 30 (average Amari error 
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between 3 and 10 for 2 sources and 1000 observations) or Gretton et al (2004) table 2 
(average Amari error between 2 and 6 for 2 sources and 1000 observations). 

Finally we give other runs on the example of the uniform density at resolution j = 2 under 
different parameters settings, and relatively fewer number of observations. 



obs . 


dim 


L 


Amari start 


Amari end 


cont. start 


cont . end 


it . 


250 


2 


10 


47 


387 


38 


919 





279E-01 





193E-01 


2.4 


250 


2 


13 


47 


387 


32 


470 





279E-01 





170E-01 


2 . 2 


250 


2 


16 


47 


387 


17 


915 





279E-01 





603E-02 


2.3 


250 


2 


19 


47 


387 


19 


049 





279E-01 





598E-02 


2.6 


500 


2 


10 


51 


097 


20 


700 





246E-01 





106E-01 


2.1 


500 


2 


13 


51 


097 


6 


644 





246E-01 





398E-02 


2.2 


500 


2 


16 


51 


097 


21 


063 





246E-01 





109E-01 


2.1 


500 


2 


19 


51 


097 


14 


734 





246E-01 





839E-02 


2.4 


1000 


2 


10 


41 


064 


3 


533 





167E-01 





186E-02 


2.3 


1000 


2 


13 


41 


064 


3 


071 





167E-01 





190E-02 


2.1 


1000 


2 


16 


41 


064 


3 


518 





167E-01 





194E-02 


1.9 


1000 


3 


16 


49 


607 


15 


082 





405E-01 





127E-01 


4.8 


5000 


3 


10 


49 


575 


5 


405 





390E-01 





399E-02 


4.5 


5000 


3 


16 


49 


575 


1 


668 





390E-01 





960E-03 


4.7 


5000 


4 


10 


43 


004 


17 


036 





561E-01 





190E-01 


4.4 


5000 


5 


10 


38 


400 


29 


679 





800E-01 





559E-01 


4.1 


5000 


5 


16 


38 


400 


4 


233 





798E-01 





700E-02 


5.0 


5000 


6 


16 


42 


529 


10 


841 





114E+00 





278E-01 


4.9 


5000 


7 


16 


41 


128 


15 


761 





188E+00 





573E-01 


5.0 


5000 


8 


16 


39 


883 


14 


137 





286E+00 





743E-01 


5.0 



Table 5. Average results of 10 runs, j=2, with a D4, truncated at 5 iterations. 



One can see that raising the dyadic approximation parameter L tends to improve the 
minimization when the number of observations is "low" relatively to the number or cells 
2 jd , but that 500 observations in dimension 2 seems to be a minimum in the current state 
of the program. In higher dimensions, a higher number of observations is required, and in 
dimension 6 and above, 5000 is not enough at L=16. 

A visual example in dimension 2 

In dimension 2, we are exempted from any added complication brought by a gradient descent 
and Stiefel minimization. After whitening, the inverse of A is an orthogonal matrix, whose 
membership can be restricted to 50(2), ignoring reflections. So there is only one parameter 
9 to find to achieve reverse mixing. Since permutations of axes are also void operations 
in ICA, angles in the range to n/2 are enough to find out the minimum Wo which, right 
multiplied by N, will recover the ICA inverse of A. And A can be set to the identity matrix, 
because what changes when A is not the identity, but any invertible matrix, is completely 
contained in N. 

Figures below show the wavelet contrast in W and the amari distance d(A, WN) (where N 
is the matrix computed after whitening), functions of the rotation angle of the matrix W 
restricted to one period, [0,n/2]. The minimums are not necessarily at a zero angle, for 
precisely, mere whitening leaves the signal in a random rotated position (to reproduce the 
following results run the program icalette2). 

We see that, provided Amari error and wavelet contrast have coinciding minima, any line 
search algorithm will find the angle to reverse the mixing effect. We see also in Fig. 2 that 
the Haar wavelet contrast is perfectly suitable to detect independence, so that minimization 
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methods not gradient based might work very well in this case. 



On the example of the uniform density (Fig. 3) we have an illustration of a non smooth 
contrast variation typical of a too high resolution j given regularity and number of obser- 
vations. 



exponential_0_1 J= 6 Db= 4 nobs= 10000 Student_0_1_3 J= 2 Db= 2 nobs= 10000 




Fig.l. Exponential, D4, j=6, n=10000 Fig. 2. Student, D2, j=2, n=10000 




7. Appendix 



Lemma 7.2 (Approximation by the j-dependent U-statistic) 

Let (Xi . . . , X n ) be an i.i.d. sample of a random variable on R d and let p, a, m be positive integers 
verifying p + a = m. Let hj be a (possibly unsymmetric) kernel function on M. md defined as 

hj(x!, . . .,x m ) = $jk{xi) ■ ■ ■ Qjkixpfyjkt! o n ll (x p+ i) . . . ip jk e a o ir e '(x p+a ), 
with tt 1 the canonical projection on component I and $(x) = JJi =1 <p ° Tr e (x). 
Consider the associated U-statistic and Von Mises V statistic, 

Unj = - — —. - ^2 hj {X h , . . . , Xi m ), V n j = — hj (X^ , . . . , X im ). 
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The following relation holds: 

E\U nj V nj \ r = [—^) 0{n^-z). (11) 

In corollary, E\U no - V n j\ r = 0(n~ r ) for r = 1, 2 and E\U n j - V n j\ r — 0(n _1_ i) for r > 2 and 
2^ d P+^/n< 1. 

proof 

The first lines use the proof of the original lemma (see for example Serfling, 1980), with 
special care for unsymmetric kernels. 

Let W nj be the average of all terms hj(X il , . . . ,X im ) with at least one equality i a = i b for 
a ^ b and 1 < a, b < to; there are n m - A™ such terms. 

f2 denoting the set of n m unconstrained indexes, one has the relation 

(n m -A™)(W nj -U nj )= h(X il ,...,X i J-n m U nj + £ h(X h , . . . , X im ) 

= n m (V nj -U nj ) 

Hence, using Minkowski inequality and the fact that (n m - A™) = 0(n m_1 ) is positive, one 
obtains, 

E] A \U nj - V n j r = n- mr (n m A™f E? A \U nj - W nj \ r 

< n- mr (n m AZT (E n fA \U nj \ r + E] A \W nj \ r ) 
<0(n- r )(E] A \U nj \ r + E]jW nj \ r ). 

It remains to bound to right parenthesis. 
Using Minkowski inequality, one has, 

E]j\u n \ r <[A™]-i Yl ^ - i^c^x) ■ - ■ ^-fcC^)^-^ (^/^.j - ■ - ^- fc ^ )r 

= E]J\<p jkil (X^)\\..E]J^ jki 4X^)\ r E]J\^ jk (X 1 )\ r 

Next Eyjip jk e(X()\ r = 2i r / 2 f\ip(2i X - k l )\ r ff{x)dx < 2 J '(S- 1 )||/^|| 00 ||^||^ and by the same 
means, E] A \* ik (Xi)\ r < 2^(§" 1 ) ||/a||oo||*||? 

So that, 

E] A \U n Y< 2^(5-D [ H H/^ll J ||^||7 2^-^\\f A \a\H7 

Likewise for W n one obtains terms of the type E r f A ip jk £ 1 (X^ 1 ) . . . (p jk e K (X^) of which E1 A ®{X{) 
is one particular form, which are bounded exactly in the same way, and produce the same 
power of V . a 
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